Spatially aware dimension reduction for spatial transcriptomics

Spatial transcriptomics are a collection of genomic technologies that have enabled transcriptomic profiling on tissues with spatial localization information. Analyzing spatial transcriptomic data is computationally challenging, as the data collected from various spatial transcriptomic technologies are often noisy and display substantial spatial correlation across tissue locations. Here, we develop a spatially-aware dimension reduction method, SpatialPCA, that can extract a low dimensional representation of the spatial transcriptomics data with biological signal and preserved spatial correlation structure, thus unlocking many existing computational tools previously developed in single-cell RNAseq studies for tailored analysis of spatial transcriptomics. We illustrate the benefits of SpatialPCA for spatial domain detection and explores its utility for trajectory inference on the tissue and for high-resolution spatial map construction. In the real data applications, SpatialPCA identifies key molecular and immunological signatures in a detected tumor surrounding microenvironment, including a tertiary lymphoid structure that shapes the gradual transcriptomic transition during tumorigenesis and metastasis. In addition, SpatialPCA detects the past neuronal developmental history that underlies the current transcriptomic landscape across tissue locations in the cortex.

In the above low-rank approximation, we choose the rank r as a function of sample size. Specifically, for data with a large sample size (n>5,000), we evaluated quantities in equations (6) and (8) in each iteration using a small r=20, since these quantities are insensitive to the choice of rank r. We obtained the estimates B ; in equation (10) using a relatively large r, with r set to be 10% of the sample size n, which ensures the top r eigen values to explain at least 90% of the variance in the present study. For data with a small sample size (n≤5,000), we use the same r

Supplementary Figures
Supplementary Figure 1. Simulation results for spatial domain clustering and cell type clustering in single cell resolution. a. The mean and variance relationship between real scRNA-seq count data and simulated single cell count data are consistent. b. Spatial domain clustering results using different methods paired with spatially variable genes (SVGs), highly variable genes (HVGs) and all genes in four simulation scenarios (n=10,000 cells). c. Cell type clustering results using different methods (n=10,000 cells). In SpatialPCA, we aim to identify spatial domains. In the simulations, SpatialPCA has highest adjusted Rand index (ARI, the higher the better) in spatial domain clustering and lowest ARI in cell type clustering, highlights the different goals in spatial domain and cell type detection. d. After controlling for cell types as covariates in SpatialPCA, the clustering performance is the best when there is one dominant cell type in each spatial domain (scenario 1), and the performance reduces when there are multiple cell types mixed in each spatial domain (scenario 2, 3 and 4, sample size n=10,000 cells). In the boxplots in b-d, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. These results highlight the spatial domains are driven by the cell type compositions.

Supplementary Figure 2.
Simulation results for spatial domain clustering in single cell resolution. Spatial domain clustering results using different methods paired with SVGs, HVGs and all genes in four simulation scenarios (n=10,000 cells), in terms of a. Normalized mutual information (NMI, the higher the better), b.  Figure 5. Simulation results for spatial domain clustering at spot level with artifactual spatial correlation between spots. a-b. Spatial domain clustering results using different methods at spot diameter being 90um (n=5,077 spots) in terms of adjusted Rand index (ARI, the higher the better) and spatial chaos score (CHAOS, the lower the better). c-d. Spatial domain clustering results using different methods at spot diameter being 107um (n=3,602 spots) in terms of ARI and CHAOS. e-f. Spatial domain clustering results using different methods at spot diameter being 145um (n=1,948 spots) in terms of ARI and CHAOS. In the boxplots in a-f, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively.
Supplementary Figure 6. High resolution spatial map reconstruction and gene expression imputation simulation. a. High resolution spatial map clustering results for spot level simulation in four scenarios at spot diameter being 145um (n=1,948 spots). We compared SpatialPCA with BayesSpace in terms of adjusted Rand index (ARI, the higher the better), normalized mutual information (NMI, the higher the better), percentage of abnormal spots (PAS, the lower the better) and spatial chaos score (CHAOS, the lower the better). b. High resolution gene expression prediction results for spot level simulation in four scenarios at spot diameter being 145um (n=1,948 spots). We compared SpatialPCA with BayesSpace for the Pearson's correlation between predicted gene expression with ground truth expression. In the boxplots in a-b, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. L a y e r1 L a y e r2 L a y e r3 L a y e r4 L a y e r5 L a y e r6 W M L a y e r1 L a y e r2 L a y e r3 L a y e r4 L a y e r5 L a y e r6 W M L a y e r1 L a y e r2 L a y e r3 L a y e r4 L a y e r5 L a y e r6 W M Setting 2: using top 1000 spatially variable genes detected by SPARK. Setting 3: using all spatially variable genes detected by SPARK. Setting 4: using all highly variable genes detected by Seurat. Setting 5: using top 10 Spatial PCs. Setting 6: using top 20 Spatial PCs. Setting 7: using top 30 Spatial PCs. Setting 8: using top 50 Spatial PCs. Setting 9: using Gaussian kernel. Setting 10: using Cauchy kernel. Setting 11: using quadratic kernel. Setting 12: controlling for cell types when selecting SVGs in SPARK-X. Setting 13: controlling for cell types in SpatialPCA. Setting 14: controlling for cell types by regressing them out from the input gene expression and taking the residuals. Setting 15: controlling for cell density in the spots. Setting 16: gene expression normalized through SCTransform normalization. Setting 17: gene expression normalized through log normalization. Setting 18: taking the histology information as a third dimension in location matrix. b. Clustering accuracy as measured by adjusted Rand index (ARI, the higher the better) for different settings. c. Clustering accuracy as measured by normalized mutual information (NMI, the higher the better) for different settings. d. Spatial continuity of the inferred clusters as measured by percentage of abnormal spots (PAS, the lower the better) for different settings. e. Spatial continuity of the inferred clusters as measured by spatial chaos score (CHAOS, the lower the better) for different settings.
Supplementary Figure 10. Additional sensitivity analyses for the DLPFC data. a. The spatial clustering results from SpatialPCA and SpaGCN as measured by adjusted Rand index (ARI, the higher the better, y-axis) across different values of the bandwidth parameter (x-axis) in sample 151676. Results are obtained using the Gaussian kernel, with the red dash line representing the ARI value at the default bandwidth obtained from the SJ method. b. Left: visualization of the spatial domains detected in SpatialPCA with a distance matrix calculated from Delaunay triangulation. Right: comparison of spatial clustering accuracy measured by ARI for using Delaunay distance versus using the Euclidian distance. c. Left: spatial clusters obtained by jointly modeling multiple tissue sections (samples 151507, 151069, and 151673) using SpatialPCA. Right: comparison of spatial clustering accuracy by ARI for joint modeling of multiple tissue sections versus modeling each section separately.
Supplementary Figure 11. Clustering results obtained based on different methods in the DLPFC data. a. Clustering results measured by adjusted Rand index (ARI, the higher the better) in all 12 sections. In dimension reduction methods (SpatialPCA, PCA, and NMF), clustering was performed based on the inferred lowdimensional components. For spatial domain clustering methods (BayesSpace, SpaGCN and HMRF), clustering was performed based the default settings. All the methods are paired with SVGs, HVGs and all genes. b-d. Clustering results measured by normalized mutual information (NMI, the higher the better), percentage of abnormal spots (PAS, the lower the better), and local inverse Simpson's index (LISI, the lower the better) in different methods with their default settings in all 12 sections. Clustering results of PCA and NMF are obtained with SVGs. In the boxplots in a-d, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. clusters. Setting 4: using top 500 spatially variable genes detected by SPARK-X. Setting 5: using top all spatially variable genes detected by SPARK-X. Setting 6: using all spatially variable genes detected by SPARK. Setting 7: using all highly variable genes detected by Seurat. Setting 8: using Gaussian kernel. Setting 9: using Cauchy kernel. Setting 10: using quadratic kernel. Setting 11: using top 10 Spatial PCs. Setting 12: using top 20 Spatial PCs. Setting 13: using top 30 Spatial PCs. Setting 14: controlling for cell types when selecting SVGs in SPARK-X. Setting 15: controlling for cell types in SpatialPCA. Setting 16: controlling for cell types by regressing them out from the input gene expression and take the residuals. Setting 15: controlling for cell density in the spots. Setting 17: gene expression normalized through SCTransform normalization. Setting 18: gene expression normalized through log normalization. b. Spatial continuity of the inferred clusters as measured by percentage of abnormal spots (PAS, the lower the better) for different settings. c. Spatial continuity of the inferred clusters as measured by spatial chaos score (CHAOS, the lower the better) for different settings.

Supplementary Figure 21. Gene set enrichment analysis on the region-specific genes in the Slide-seq data.
The top 10 enriched gene sets are shown for each of the eight detected tissue regions. Color represents different data sources for annotating the gene sets. The enrichment is given as -log10 adjusted p-value (g:SCS correction, details in Methods) of the region specific genes.    Figure 24. RGB plots for the Slide-seq V2 data. a-f. For SpatialPCA, PCA, and NMF, we summarized the inferred low dimensional components into three tSNE (a, c, e) or UMAP components (b, d, f) and visualized the three resulting components with red/green/blue (RGB) colors through the RGB plot. Color code corresponds to the RGB values of each location's three UMAP or tSNE components inferred from low dimensional components in dimension reduction. Different colors indicate different values for each of the three UMAP or tSNE components on the tissue section, highlighting the difference of the low dimensional components from different methods included in the panel. The RGB plot from SpatialPCA displays tissue structure organization of the hippocampus region and show less color differences within a local area. We also scaled up spatial PCs/regular PCs 10 times (c-d) and 20 times (e-f) to see the influence of range of the PCs to RGB plots, the tSNE/UMAP results and RGB plots have very similar patterns as shown at the original scale (a-b). g-h. The weighted RGB values in SpatialPCA have lower variance than PCA or NMF in nearby spots (n=51,398 locations). In the boxplots in g-h, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. i-j. Spatial continuity of the inferred clusters as measured by percentage of abnormal spots (PAS, the lower the better) and spatial chaos score (CHAOS, the lower the better) in SpatialPCA is the lowest compared with BayesSpace, SpaGCN, PCA and NMF. clustering results obtained using all highly variable genes detected by Seurat. Setting 8: clustering results obtained using top 10 Spatial PCs. Setting 9: clustering results obtained using top 20 Spatial PCs. Setting 10: clustering results obtained using top 30 Spatial PCs. Setting 11: clustering results obtained by controlling cell types in SpatialPCA. Setting 12: clustering results obtained through controlling cell types by regressing them out from the input gene expression and take the residuals. Setting 13: clustering results obtained by controlling for cell density in the spots. Setting 14: clustering results obtained by SpatialPCA with gene expression normalized through SCTransform normalization. Setting 15: clustering results obtained by SpatialPCA with gene expression normalized through log normalization. b. Clustering spatial continuity measured by percentage of abnormal spots (PAS, the lower the better) in different settings. c. Clustering spatial continuity measured by spatial chaos score (CHAOS, the lower the better) in different settings. C lu s te r1 C lu s te r2 C lu s te r3 C lu s te r4 C lu s te r5 C lu s te r6 C lu s te r7 C lu s te r8 C lu s te r9 C lu s te r1 0 C lu s te r1 1 C lu s te r1 2 C lu s te r1 3 C lu s te r1 4 SpatialPCA 0 25 50 75 100 C lu s te r1 C lu s te r2 C lu s te r3 C lu s te r4 C lu s te r5 C lu s te r6 C lu s te r7 C lu s te r8 C lu s te r9 C lu s te r1 0 C lu s te r1 1 C lu s te r1 2 C lu s te r1 3 C lu s te r1 4   (n=13,195 locations). In the boxplot, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively. c. Visualizaton of the inferred pseudo-time for seven trajectories (from left to right) in cortical layers of the Slide-seq V2 data in PCA. d. Visualizaton of the inferred pseudo-time for five trajectories (from left to right) in cortical layers of the Slide-seq V2 data in NMF. Figure 32. RGB plots for the ST data. a-f. For SpatialPCA, PCA, and NMF, we summarized the inferred low dimensional components into three UMAP components (a, c, e) and tSNE (b, d, f) and visualized the three resulting components with red/green/blue (RGB) colors through the RGB plot. Color code corresponds to the RGB values of each location's three UMAP or tSNE components inferred from low dimensional components in dimension reduction. Different colors indicate different values for each of the three UMAP or tSNE components on the tissue section, highlighting the difference of the low dimensional components from different methods included in the panel. The RGB plot from SpatialPCA displays tissue structure organization of the breast tumor and show less color differences within a local area. We also scaled up spatial PCs/regular PCs 10 times (c-d) and 20 times (e-f) to see the influence of range of the PCs to RGB plots, the tSNE/UMAP results and RGB plots have very similar patterns as shown at the original scale (a-b). g-h. The weighted RGB values in SpatialPCA have lower variance than PCA or NMF in nearby spots (n=607 spots). In the boxplots in g-h, the center line, box limits and whiskers denote the median, upper and lower quartiles, and 1.5× interquartile range, respectively.